Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning
نویسندگان
چکیده
It has recently been shown that deep neural networks (DNN) can improve the quality of statistical parametric speech synthesis (SPSS) when using a source-filter vocoder. Our own previous work has furthermore shown that a dynamic sinusoidal model (DSM) is also highly suited to DNN-based SPSS, whereby sinusoids may either be used themselves as a “direct parameterisation” (DIR), or they may be encoded using an “intermediate spectral parameterisation” (INT). The approach in that work was effectively to replace a decision tree with a neural network. However, waveform parameterisation and synthesis steps that have been developed to suit HMMs may not fully exploit DNN capabilities. Here, in contrast, we investigate ways to combine INT and DIR at the levels of both DNN modelling and waveform generation. For DNN training, we propose to use multi-task learning to model cepstra (from INT) and log amplitudes (from DIR) as primary and secondary tasks. Our results show combining these improves modelling accuracy for both tasks. Next, during synthesis, instead of discarding parameters from the second task, a fusion method using harmonic amplitudes derived from both tasks is applied. Preference tests show the proposed method gives improved performance, and that this applies to synthesising both with and without global variance parameters.
منابع مشابه
Multi-task learning deep neural networks for speech feature denoising
Traditional automatic speech recognition (ASR) systems usually get a sharp performance drop when noise presents in speech. To make a robust ASR, we introduce a new model using the multi-task learning deep neural networks (MTL-DNN) to solve the speech denoising task in feature level. In this model, the networks are initialized by pre-training restricted Boltzmann machines (RBM) and fine-tuned by...
متن کاملOff-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model
In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...
متن کاملJoint Optimization of Denoising Autoencoder and DNN Acoustic Model Based on Multi-Target Learning for Noisy Speech Recognition
Denoising autoencoders (DAEs) have been investigated for enhancing noisy speech before feeding it to the back-end deep neural network (DNN) acoustic model, but there may be a mismatch between the DAE output and the expected input of the back-end DNN, and also inconsistency between the training objective functions of the two networks. In this paper, a joint optimization method of the front-end D...
متن کاملA deep-learning based native-language classification by using a latent semantic analysis for the NLI Shared Task 2017
This paper proposes a deep-learning based native-language identification (NLI) using a latent semantic analysis (LSA) as a participant (ETRI-SLP) of the NLI Shared Task 2017 (Malmasi et al., 2017) where the NLI Shared Task 2017 aims to detect the native language of an essay or speech response of a standardized assessment of English proficiency for academic purposes. To this end, we use the six ...
متن کاملAttribute knowledge integration for speech recognition based on multi-task learning neural networks
It has been demonstrated that the speech recognition performance can be improved by adding extra articulatory information, and subsequently, how to use such information effectively becomes a challenging problem. In this paper, we propose an attribute-based knowledge integration architecture which is realized by modeling and learning both acoustic and articulatory cues simultaneously in a unifor...
متن کامل